Introduction

The project will train a model that will predict the win rate of a League of Legends game.

What is League of Legends?

League of Legends, also known as League, is a multiplayer online battle arena video game.

In a typical game, two teams of five players each face off against each other on a battlefield. Each player controls a champion with unique abilities and characteristics, and the goal is to destroy the enemy team’s base. The champions are divided into different classes, such as marksmen, mages, and tanks, and each player must carefully select their champion and coordinate with their teammates to succeed. The game is known for its complex strategy and intense team play.

players can earn experience points and gold by defeating enemy champions and minions. These resources can be used to level up the player’s champion and purchase items that grant additional abilities and bonuses. As the game progresses, champions become stronger and gain access to more powerful abilities. The game also features a variety of different maps, each with its own unique layout and challenges. The outcome of a match is determined by a combination of skill, strategy, and teamwork.

How to win the game?

Destroying the base (aka the Nexus).

each team’s base is heavily guarded by 11 turrets from three directions. These turrets deal a massive amount of damage to enemy champions and minions when they are in range. In order to deal damage to the enemy base, a team must first destroy at least five of these turrets. This can be a challenging task, as the turrets are powerful and well-protected. As a result, players must carefully plan their attacks and coordinate with their teammates in order to succeed.

In addition to the turrets, each team’s base also generates minions that automatically march towards the enemy base. These minions deal damage to enemy minions and buildings along the way. The enemy team will also attempt to protect their turrets and base by attacking your champions. If your champion is killed, they will respawn at your base after a certain amount of time. Alternatively, you can force your opponent to retreat by dealing a sufficient amount of damage to them. In this way, you can protect your own turrets and minions and pave the way for an attack on the enemy base.

Maximizing your chance to win.

Based on the information provided, a successful strategy in League of Legends may involve the following steps:

  1. Level up and earn gold by defeating minions or neutral entities on the battlefield.

  2. Use the gold to purchase items that increase your champion’s damage or defense.

  3. Attack and defeat enemy champions to earn even more gold and increase your team’s chances of victory.

  4. Focus on destroying enemy turrets in order to clear the way for an attack on the enemy base.

  5. Repeat steps 1-4, and prioritize steps 1-3 in a way that best supports your team’s overall strategy.

There are several common beliefs about what contributes most to a team’s win rate in League of Legends. Testing these beliefs using data analysis and modeling can help determine whether or not they are accurate. For example:

  • Maximizing champion kills may be seen as a key factor in winning a game, as killing a champion grants a large amount of gold.

  • The first kill of a game may be particularly important, as it grants a bonus of gold and can help a team gain an early advantage.

  • Vision score, or the ability to see and track enemy movements on the battlefield, may be critical to winning a game, as it can prevent teams from being ambushed and caught off guard.

  • Stealing objectives from the enemy team, such as killing their minions or destroying their turrets, may demoralize the opponent and increase the likelihood of mistakes.

The current model

The steps outlined above are a simplified summary of the game mechanics and strategy in League of Legends. In reality, there are many other factors that can impact a team’s win rate, such as stealing neutral entities, placing vision wards, or sacrificing for teammates by taking damage on their behalf. A predictive model can be useful for identifying the most important factors that contribute to a team’s success and prioritizing them in the decision-making process. In addition to making predictions, the model can also help identify which components have the largest impact on win rate and should be prioritized in strategy development.

Dataset

The dataset was downloaded from Kaggle, which records ranking games that created in one day from the Korean server.

Using a large dataset with millions of observations can provide valuable insights and improve the accuracy of predictive models. However, it can also be challenging to work with such a large dataset, especially if you are using a resource-intensive model like random forest or support vector machines with regularization. In this case, using a smaller subset of the data can help reduce the time and memory requirements of the analysis, while still providing useful insights. By carefully selecting a representative sample of the data, you can still obtain valuable results without overwhelming your computer’s resources. However, it’s important to keep in mind that using a smaller sample may result in less accurate predictions, so it’s a trade-off that should be carefully considered.

The workflow

  1. Clean data by removing invalid rows and factoring certain predictors
  2. Exploratory data analysis to remove certain predictors
  3. Split data and make cross validation
  4. Build logistic regression, linear discriminant analysis, quadratic discriminant analysis, random forests
  5. Evaluate accuracy of the model

Load all the package

suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(tidymodels))
suppressPackageStartupMessages(library(corrplot))
suppressPackageStartupMessages(library(discrim))
suppressPackageStartupMessages(library(poissonreg))
suppressPackageStartupMessages(library(corrr))
suppressPackageStartupMessages(library(klaR))
suppressPackageStartupMessages(library(vroom))
suppressPackageStartupMessages(library(MASS))
suppressPackageStartupMessages(library(janitor))
suppressPackageStartupMessages(library(ggcorrplot))
suppressPackageStartupMessages(library(vip))
suppressPackageStartupMessages(library(ranger))
suppressPackageStartupMessages(library(kernlab))
suppressPackageStartupMessages(library(splitstackshape))
suppressPackageStartupMessages(library(xgboost))
tidymodels_prefer()

Cleaning and manipulation

Fix Encoding

As the author of the dataset suggests, the original encoding is cp949 (One type of Korean character encoding). So it is wise to convert it to UTF-8 first to avoid encoding issue

Here I use the iconv command line tool to change the encoding

iconv -f cp949 -t utf-8 league_data.csv > league_data_utf8.csv

Loading the dataset

league_all_df <- clean_names(vroom("dataset/league_data_utf8.csv"))
## Rows: 2589340 Columns: 58
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (10): summonerName, win, teamPosition, visionScore, puuid, summonerId, ...
## dbl  (40): no, gameNo, playerNo, participantId, teamId, kills, deaths, assis...
## lgl   (6): gameEndedInEarlySurrender, gameEndedInSurrender, teamEarlySurrend...
## dttm  (2): CreationTime, KoreanTime
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Drop variables that are not needed
league_all_df <- league_all_df %>%
    select(
        -c(summoner_name,
        puuid,
        summoner_id,
        creation_time,
        participant_id,
        ))

Remove invalid games

Remove rows that contains values that do not match the column types

In the case of games that are remade because one or more players are disconnected, it may be best to exclude these games from the analysis. Remake voting typically occurs within the first 1-2 minutes of a game, so you can use this as a cutoff point for determining which games to include in the analysis. However, it’s also worth considering that games can continue for a period of time after the remake voting has passed, so using a slightly longer timeframe of 5 minutes as a boundary may provide more accurate results. This will ensure that any games that are remade within the first 5 minutes are excluded, while still allowing games that continue after the remake to be included in the analysis.

# for (col in colnames(league_all_df)) {
#     if (is.character(league_all_df[[col]])) {
#         print(unique(league_all_df[,col]))
#     }
# }
# league_all_df <- 
league_all_df <- league_all_df %>%
    filter(time_played > 360) %>% 
    filter(win %in% c("True", "False")) %>% 
    filter(team_position %in% c("TOP", "JUNGLE", "MIDDLE", "BOTTOM", "UTILITY")) %>% 
    filter(first_tower_kill %in% c("True", "False"))
    
# Fixing vroom type import issue
league_all_df$win <- as.logical(league_all_df$win)
league_all_df$first_tower_kill <- as.logical(league_all_df$first_tower_kill)
league_all_df$vision_score <- as.numeric(league_all_df$vision_score)
league_all_df$champ_level <- as.numeric(league_all_df$champ_level)
league_all_df$dragon_kills <- as.numeric(league_all_df$dragon_kills)

Remove NAs - listing column containing NA - remove rows containing NA

# Get columns with na values
na_columns <- names(which(colSums(is.na(league_all_df)) > 0))
print(na_columns)
## [1] "no"
# Loop through the columns and filter out not na rows
for (col in na_columns) {
    league_all_df <- league_all_df %>%
        filter(!is.na({{col}}))
}

Take 10000 observations but stratified on win

league_all_df <- stratified(league_all_df, "win", 10000)

Conversion and Factor

league_all_df <- league_all_df %>%
    mutate(team = case_when(
        team_id == "100" ~ "blue",
        team_id == "200" ~ "red",
    ))

Convert appropriate predictors to factor

str(league_all_df)
## Classes 'data.table' and 'data.frame':   20000 obs. of  54 variables:
##  $ no                             : num  371 110872 204044 258248 248430 ...
##  $ game_no                        : num  6e+09 6e+09 6e+09 6e+09 6e+09 ...
##  $ player_no                      : num  6 6 0 8 3 9 9 3 0 2 ...
##  $ korean_time                    : POSIXct, format: "2022-07-02 09:51:36" "2022-07-02 19:42:06" ...
##  $ team_id                        : num  200 200 100 200 100 200 200 100 100 100 ...
##  $ game_ended_in_early_surrender  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ game_ended_in_surrender        : logi  TRUE TRUE TRUE FALSE FALSE FALSE ...
##  $ team_early_surrendered         : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ win                            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ team_position                  : chr  "JUNGLE" "JUNGLE" "TOP" "BOTTOM" ...
##  $ kills                          : num  2 5 3 2 23 1 3 9 5 5 ...
##  $ deaths                         : num  4 8 0 4 5 8 9 10 5 4 ...
##  $ assists                        : num  4 3 2 4 8 21 9 7 1 6 ...
##  $ objectives_stolen              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ vision_score                   : num  21 18 16 6 42 80 39 50 2 22 ...
##  $ baron_kills                    : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ bounty_level                   : num  0 0 3 0 0 0 0 0 0 0 ...
##  $ champ_level                    : num  13 14 12 10 17 16 12 15 11 12 ...
##  $ champion_name                  : chr  "Ekko" "Shyvana" "Jayce" "Zeri" ...
##  $ damage_dealt_to_buildings      : num  0 535 1845 55 9341 ...
##  $ damage_dealt_to_objectives     : num  20234 21476 1845 2398 36837 ...
##  $ detector_wards_placed          : num  2 3 1 1 2 3 8 2 0 2 ...
##  $ double_kills                   : num  0 1 0 1 4 0 0 1 0 0 ...
##  $ dragon_kills                   : num  1 1 0 0 1 1 0 0 0 0 ...
##  $ first_blood_assist             : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ first_blood_kill               : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ first_tower_assist             : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ first_tower_kill               : logi  FALSE FALSE FALSE FALSE TRUE FALSE ...
##  $ gold_earned                    : num  9183 10997 6918 7231 21702 ...
##  $ inhibitor_kills                : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ inhibitor_takedowns            : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ inhibitors_lost                : num  0 3 0 2 1 3 3 1 0 0 ...
##  $ killing_sprees                 : num  0 2 1 1 3 0 1 2 2 1 ...
##  $ largest_killing_spree          : num  0 2 3 2 10 0 2 3 2 4 ...
##  $ largest_multi_kill             : num  1 2 1 2 2 1 1 3 1 1 ...
##  $ longest_time_spent_living      : num  572 688 0 457 621 786 323 448 325 559 ...
##  $ neutral_minions_killed         : num  152 111 0 1 39 4 0 4 4 4 ...
##  $ objectives_stolen_assists      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ penta_kills                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ quadra_kills                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ time_c_cing_others             : num  15 6 8 7 7 61 49 23 5 4 ...
##  $ time_played                    : num  1563 1965 1036 1292 2153 ...
##  $ total_damage_dealt             : num  151959 149042 78922 58277 314578 ...
##  $ total_damage_dealt_to_champions: num  8564 19938 12433 11629 47639 ...
##  $ total_damage_taken             : num  26808 48266 7307 12195 29260 ...
##  $ total_heal                     : num  11859 7903 0 1550 5439 ...
##  $ total_heals_on_teammates       : num  0 0 0 258 547 0 0 0 225 0 ...
##  $ total_minions_killed           : num  18 44 138 118 247 58 34 40 123 129 ...
##  $ total_time_cc_dealt            : num  444 668 90 80 147 699 291 91 138 102 ...
##  $ total_time_spent_dead          : num  86 285 0 92 195 309 235 277 101 102 ...
##  $ total_units_healed             : num  1 1 0 2 3 1 1 1 3 1 ...
##  $ triple_kills                   : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ unreal_kills                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ team                           : chr  "red" "red" "blue" "red" ...
##  - attr(*, ".internal.selfref")=<externalptr>
league_all_df_factored <- league_all_df %>%
    mutate(
        across(
            c(
                win,
                team_position,
                first_blood_kill,
                first_blood_assist,
                first_tower_kill,
                first_tower_assist,
                team,
                champion_name),
            as.factor)
    )
# Releveling
league_all_df_factored <- league_all_df_factored %>%
    mutate(across(
            c(
                win,
                first_blood_kill,
                first_blood_assist,
                first_tower_kill,
                first_tower_assist,
            ),
            ~fct_relevel(., c("TRUE", "FALSE"))))

EDA

Now I will start the exploratory data analysis. First, I will have the correlation matrix prepared by dummying several factorized predictors

The correlation matrix

print(colnames(league_all_df_factored))
##  [1] "no"                              "game_no"                        
##  [3] "player_no"                       "korean_time"                    
##  [5] "team_id"                         "game_ended_in_early_surrender"  
##  [7] "game_ended_in_surrender"         "team_early_surrendered"         
##  [9] "win"                             "team_position"                  
## [11] "kills"                           "deaths"                         
## [13] "assists"                         "objectives_stolen"              
## [15] "vision_score"                    "baron_kills"                    
## [17] "bounty_level"                    "champ_level"                    
## [19] "champion_name"                   "damage_dealt_to_buildings"      
## [21] "damage_dealt_to_objectives"      "detector_wards_placed"          
## [23] "double_kills"                    "dragon_kills"                   
## [25] "first_blood_assist"              "first_blood_kill"               
## [27] "first_tower_assist"              "first_tower_kill"               
## [29] "gold_earned"                     "inhibitor_kills"                
## [31] "inhibitor_takedowns"             "inhibitors_lost"                
## [33] "killing_sprees"                  "largest_killing_spree"          
## [35] "largest_multi_kill"              "longest_time_spent_living"      
## [37] "neutral_minions_killed"          "objectives_stolen_assists"      
## [39] "penta_kills"                     "quadra_kills"                   
## [41] "time_c_cing_others"              "time_played"                    
## [43] "total_damage_dealt"              "total_damage_dealt_to_champions"
## [45] "total_damage_taken"              "total_heal"                     
## [47] "total_heals_on_teammates"        "total_minions_killed"           
## [49] "total_time_cc_dealt"             "total_time_spent_dead"          
## [51] "total_units_healed"              "triple_kills"                   
## [53] "unreal_kills"                    "team"
# Plot correlation matrix
league_df_eda <- select(league_all_df_factored, -c(no, game_no, player_no, korean_time, team_id, champion_name, game_ended_in_early_surrender, game_ended_in_surrender, team_early_surrendered))
correlations <- model.matrix(~., data = league_df_eda) %>% 
    cor(use='complete.obs')
## Warning in cor(., use = "complete.obs"): the standard deviation is zero
correlations %>% 
    ggcorrplot(show.diag = T, type="full", lab=TRUE, lab_size = 2, tl.srt = 90)

The correlation matrix can be useful for identifying predictors that are correlated with losing the game. However, it’s important to note that some of these predictors may be highly collinear, meaning that they are strongly correlated with each other. This can reduce the interpretability of the model and make it difficult to determine the specific factors that are contributing to a team’s success or failure. In order to avoid collinearity and improve the interpretability of the model, it may be necessary to select a subset of predictors that capture the essence of the data while removing redundant or highly correlated predictors.

For example, the following predictors may be disregarded because they are highly correlated with each other or with other predictors of interest:

  • double_kill, triple_kill, quadra_kill, penta_kill, killing_spree, largest_killing_spree, and unreal kill: these predictors are highly correlated with each other, and they are also highly correlated with the kill, gold, and total_damage_dealt predictors, which are our predictors of interest.

  • total_damage_dealt: this predictor is simply the sum of total_damage_dealt_to_champion and damage to non-champion entities (which can be inferred from other predictors such as neutral_minions_killed and dragon_kills). Because it is collinear with total_damage_dealt_to_champion, it may be redundant and can be removed.

  • total_time_spent_dead: this predictor is the sum of death time at each level, but it is highly correlated with the number of deaths and may not provide additional useful information.

  • gold_earned: this predictor is calculated as the sum of kills, assists, and building destroyed, but I am interested in these components individually rather than as a combined value.

The final lists of variables becomes

Variable Explanation
kills The number of opponent champion player kills
assists The number of opponent champion kill player assisted
deaths The number of times player has been killed
champ_level The level of champion when the game ends
bounty_level The level of bounty of a player, higher bounty gives opponent more gold if successfully killed the player
objectives_stolen The number of dragon, rift herald and baron stolen
objectives_stolen_assist The number of dragon, rift herald and baron stolen assist
vision_score The amount of vision score player get from placing and countering opponent ward
damage_dealt_to_buildings Damage dealt to to turrets and inhibitors
first_blood_assist If the player assist teammate in getting the first blood
first_blood_kill If the player get the first blood kill
fist_tower_assist If the player assist teammate in getting the first tower
fist_tower_kill If the player getting the first tower
inhibitor_takedowns The number of inhibitor destruction player participated in
inhibitor_lost the number of inhibitor lost as a team
longest_time_spend_living longest time player lived before between consecutive death
neutral_minion_killed jungle monster killed
time_c_cing_others total time of crowd control cast on opponent champions
total_damage_dealt_to_champions total amount of damage dealt to opponent champions
total_damage_taken total damage receives from opponent champion
total_heal total healing received or self-casted
total_heals_on_teammates total healing cast on teammates
total_minion_killed total minion killed
dragon_kills The number of dragon player kills
baron_kills The number of baron player kills
team either blue or red team

In addition to selecting a subset of predictors, it may also be useful to include interactions between certain predictors in the model. Interactions capture the enhancement effect that one predictor has on another, and can provide valuable information about how different factors influence each other. For example, the following interactions may be relevant to the analysis:

  • kill:champ_level: the more kills a player gets, the more likely they are to gain experience and reach higher levels, which can make it easier to kill more champions. This interaction captures the feedback loop between kills and champion level.

  • dragon_kills:champ_level: the team that kills a dragon gains bonuses to their stats, which can make it easier to kill more dragons and gain further advantages. This interaction captures the effect of dragon kills on champion level.

  • inhibitor_lost:death: the loss of an inhibitor can make it easier for the enemy team to attack and kill your champions. This interaction captures the relationship between losing an inhibitor and the number of deaths on a team.

Now I will take a look at some distributions in details

KDA

KDA are the essential stats of the game and are often considered as the most straightforward indicators of a player’s performance and skill

Kill and Assist

The kill number distribution plot shows that, in general, more kills are associated with a higher likelihood of winning the game. This is especially true for games with more than 7 kills. However, it’s interesting to note that a significant number of players in the dataset have 0 kills and yet still manage to win the game. This suggests that there may be other factors at play, such as a more defensive playing style or a willingness to sacrifice personal kills in order to support the team.

The assist number distribution plot appears to have a similar shape, with more assists being associated with a higher likelihood of winning. This suggests that assists may play a similar role to kills in contributing to a team’s success. Overall, both kills and assists appear to be important factors in determining the outcome of a game.

ggplot(league_all_df_factored, aes(kills)) +
    geom_bar(aes(fill = win)) +
    scale_fill_manual(values = c("blue", "red"))

ggplot(league_all_df_factored, aes(assists)) +
    geom_bar(aes(fill = win)) +
    scale_fill_manual(values = c("blue", "red"))

Death

The death number distribution follows the opposite pattern of the kill number distribution. The plot shows a consistent advantage for teams with fewer deaths, with the highest win rate occurring for teams with less than 3 deaths. This suggests that minimizing deaths is an important factor in achieving success in League of Legends. However, it’s also worth noting that games with low death counts make up a significant portion of the overall distribution, so achieving a low death count may not be as difficult as it appears from the plot.

ggplot(league_all_df_factored, aes(deaths)) +
    geom_bar(aes(fill = win)) +
    scale_fill_manual(values = c("blue", "red"))

Scaling stats

These stats are strongly related to the continuous improvement of a champion

Champion level

The champion level distribution plot suggests that players with a champion level below 14 are more likely to lose the game. However, after reaching level 14, the win ratio appears to remain consistent. This turning point is likely due to game design, as all players in a game typically reach level 14 and higher-level champions no longer have a sharp advantage over lower-level champions. This suggests that reaching level 14 is an important milestone in the game, and players should prioritize leveling up their champions in order to gain this advantage.

ggplot(league_all_df_factored, aes(champ_level)) +
    geom_bar(aes(fill = win)) +
    scale_fill_manual(values = c("blue", "red"))

Total Minion killed

The win rate distribution plot for number of minions killed shows that the win rate is consistent around 50% for numbers up to 250. The spike in the number of games at around 40 is likely due to players who take on a utility role, in which they do not take all the minions for themselves but instead share the gold with their teammates. It’s interesting to note that the win rate appears to decrease for numbers greater than 250. One possible explanation for this is that players who kill a large number of minions may restrict the scaling of their teammates’ champions, leading to a lower win rate. This suggests that there may be a trade-off between personal gold acquisition and team success in League of Legends.

ggplot(league_all_df_factored, aes(total_minions_killed)) +
    geom_bar(aes(fill = win)) +
    scale_fill_manual(values = c("blue", "red"))

Longest time spent living

The death time distribution plot suggests that players are more likely to lose if they have shorter intervals between deaths. This may be because players who die frequently are unable to contribute as much to the team, limiting the team’s overall effectiveness. It’s also worth noting that the plot shows a peak at around 200 seconds, which could be due to players who are able to avoid dying for extended periods of time but still succumb to frequent deaths. Overall, this plot suggests that minimizing the time between deaths is an important factor in achieving success in League of Legends.

ggplot(league_all_df_factored, aes(longest_time_spent_living)) +
    geom_histogram(aes(fill = win), bins = 50) +
    scale_fill_manual(values = c("blue", "red"))

Building stats

Inhibitor lost

The inhibitor loss distribution plot suggests that players are more likely to lose if their team loses one or more inhibitors. This is consistent with the high correlation between win rate and inhibitors_lost, as losing inhibitors can make it easier for the enemy team to attack and defeat your team. The plot shows a steep drop in win rate for teams that lose one or more inhibitors, indicating that this is a critical factor in determining the outcome of a game. Overall, this plot supports the importance of protecting inhibitors in order to maximize your team’s chances of success.

ggplot(league_all_df_factored, aes(inhibitors_lost)) +
    geom_bar(aes(fill = win)) +
    scale_fill_manual(values = c("blue", "red"))

Damage to buildings

The total damage dealt distribution density plot shows the amount of damage that is typically required to win a game if neither team surrenders. This amount is equivalent to the combined health of all five turrets, one inhibitor, and one base. The plot shows a steep increase in the required damage as the game progresses, indicating that teams must deal significant amounts of damage in order to win. This plot provides insight into the overall strategy and pacing of a game, and can help teams plan their attack and defense in order to maximize their chances of victory.

mean(league_all_df_factored$damage_dealt_to_buildings)
## [1] 2650.275
ggplot(league_all_df_factored, aes(damage_dealt_to_buildings, stat = 'bin')) +
    geom_histogram(aes(fill = win)) +
    scale_fill_manual(values = c("blue", "red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(league_all_df_factored, aes(damage_dealt_to_buildings)) +
    geom_freqpoly(aes(fill = win))
## Warning in geom_freqpoly(aes(fill = win)): Ignoring unknown aesthetics: fill
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Team Objectives

Dragon killed and baron killed

The dragon kills distribution plot suggests that the number of dragon kills does not have a significant impact on the win rate unless more than 2 dragons are killed. This may be because killing a single dragon provides only a small advantage, but killing multiple dragons can provide a significant boost to a team’s stats and make it easier to win the game. The plot also shows that most teams do not kill any barons, but that getting a baron greatly increases the likelihood of winning. This indicates that barons are powerful objectives that can provide a significant advantage to the team that controls them. Overall, this plot provides insight into the importance of objectives in League of Legends and can help teams plan their strategy accordingly.

ggplot(league_all_df_factored, aes(dragon_kills)) +
    geom_bar(aes(fill = win)) +
    scale_fill_manual(values = c("blue", "red"))

ggplot(league_all_df_factored, aes(baron_kills)) +
    geom_bar(aes(fill = win)) +
    scale_fill_manual(values = c("blue", "red"))

Utility stats

Even though each game often has a utility champions, other roles can still perform utility action like vision or healing.

Vision Score

he vision score distribution plot suggests that having a low vision score is strongly correlated with losing the game. However, the plot also shows that having a high vision score does not significantly affect the win rate. This may be because players are matched with opponents who have similar expertise in vision and map control, so having a higher vision score does not provide a significant advantage. The plot shows a steep drop in win rate for teams with vision scores below 20, indicating that having adequate vision is crucial for success in League of Legends. Overall, this plot highlights the importance of vision and map control in the game, and can help teams prioritize these aspects of their strategy.

ggplot(league_all_df_factored, aes(vision_score)) +
    geom_bar(aes(fill = win)) +
    scale_fill_manual(values = c("blue", "red"))

Time of Crowd Control

The crowd control duration plot is a bit counterintuitive, as one might expect that teams with more crowd control duration on their opponents would have a higher win rate. However, the plot shows a peak in the win rate for teams with moderate amounts of crowd control duration, with the win rate decreasing for both higher and lower values. This may be because applying too much crowd control can limit a team’s ability to deal damage and secure objectives, while applying too little crowd control may not provide sufficient protection or disruption. The peak in the win rate at around 100 seconds of crowd control duration suggests that this may be an optimal amount for achieving success in League of Legends. Overall, this plot provides insight into the appropriate use of crowd control in the game, and can help teams plan their strategy accordingly.

ggplot(league_all_df_factored, aes(time_c_cing_others)) +
    geom_bar(aes(fill = win)) +
    scale_fill_manual(values = c("blue", "red"))

Other

Team

Blue team may have a slight edge over red team due to asymmetrical map design.

ggplot(league_all_df_factored, aes(team)) +
    geom_bar(aes(fill = win)) +
    scale_fill_manual(values = c("blue", "red"))

Train the models

Now after considering the correlation and individual, Let’s train the models with the final selection of needed predictors and add the interaction mentioned above

And It’s important to split the data since validation allows us to assess the performance of the model on unseen data. This is important because it allows us to evaluate the model’s ability to generalize to new data, rather than simply fitting to the training data. By using a validation set, we can tune the model’s hyperparameters and ensure that it is not overfitting to the training data.

Additionally, stratifying the validation set on the “win” variable is important because it ensures that the distribution of wins and losses in the validation set is representative of the overall distribution in the dataset. This is important because it ensures that the validation set is a fair and unbiased representation of the data, and allows us to accurately evaluate the model’s performance on both winning and losing games. Overall, using a validation set and stratifying on the “win” variable are important steps in building a reliable and effective machine learning model.

df_split <- league_df_eda %>% select(c(
    kills,
    deaths,
    assists,
    champ_level,
    objectives_stolen,
    objectives_stolen_assists,
    baron_kills,
    dragon_kills,
    vision_score,
    damage_dealt_to_buildings,
    first_blood_assist,
    first_blood_kill,
    first_tower_assist,
    first_tower_kill,
    inhibitor_takedowns,
    inhibitors_lost,
    longest_time_spent_living,
    neutral_minions_killed,
    time_c_cing_others,
    total_damage_dealt_to_champions,
    total_damage_taken,
    total_heal,
    total_heals_on_teammates,
    total_minions_killed,
    total_time_spent_dead,
    team,
    win
)) %>% 
    initial_split(prop = 0.8, strata = win)
league_training <- training(df_split)
league_testing <- testing(df_split)

Now after we performed CV and split the data. It’s time to train

  1. Logistic regression
  2. LDA
  3. QDA
  4. Elastic Net (with regularization)
  5. Random Forest (with regularization)
  6. Support Vector Machine (with regularization)

Recipe building

Create a recipe and add a few interaction terms

league_recipe <- league_training %>% 
    recipe(win ~ .,) %>%
        step_dummy(all_factor_predictors()) %>% 
        step_interact(terms = ~ kills:vision_score) %>%
        step_interact(terms = ~ dragon_kills:champ_level) %>%
        step_interact(terms = ~ inhibitors_lost:deaths) %>%
        step_normalize(all_numeric_predictors()) # Normalize (center and standardization)

league_recipe
## Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         26
## 
## Operations:
## 
## Dummy variables from all_factor_predictors()
## Interactions with kills:vision_score
## Interactions with dragon_kills:champ_level
## Interactions with inhibitors_lost:deaths
## Centering and scaling for all_numeric_predictors()

Create Cross Validation folding

Cross-validation is another critical step in tuning the parameters of a machine learning model. It involves dividing the training data into multiple folds, training the model on different combinations of folds, and evaluating its performance on each combination. This allows us to assess the model’s performance on different subsets of the data, and ensure that it is not overfitting to any particular subset. By using cross-validation, we can tune the model’s hyperparameters to optimize its performance on the training data.

In this case, we are using 5-fold cross-validation, which means that the training data is divided into 5 folds and the model is trained and evaluated on 5 different combinations of these folds. This allows us to assess the model’s performance on a wide range of data, while still reducing the computational cost compared to using a larger number of folds. Additionally, using 5 folds allows us to balance the trade-off between computational efficiency and the model’s ability to generalize to new data. Overall, cross-validation is an essential step in building a reliable and effective machine learning model.

league_folded <- league_training %>% 
    vfold_cv(v = 5, strata = win)

Logistic Regression

# set model 
log_reg <- logistic_reg() %>%
    set_engine("glm") %>%
    set_mode("classification")

# setup workflow
log_workflow <- workflow() %>%
    add_model(log_reg) %>%
    add_recipe(league_recipe)


# fit the model
log_fit <- fit(log_workflow, league_training)

# generate roc_auc
log_roc_auc <- augment(log_fit, new_data = league_testing) %>%
    roc_auc(truth = win, estimate = .pred_TRUE)

LDA and QDA

lda_reg <- discrim_linear() %>%
    set_mode("classification") %>%
    set_engine("MASS")

lda_workflow <- workflow() %>% 
    add_model(lda_reg) %>%
    add_recipe(league_recipe)

lda_fit <- lda_workflow %>% 
    fit(league_training)

lda_roc_auc <- augment(lda_fit, new_data = league_testing) %>% 
    roc_auc(truth = win, estimate = .pred_TRUE)

qda_reg <- discrim_quad() %>%
    set_mode("classification") %>%
    set_engine("MASS")

qda_workflow <- workflow() %>% 
    add_model(qda_reg) %>%
    add_recipe(league_recipe)

qda_fit <- qda_workflow %>% 
    fit(league_training)

qda_roc_auc <- augment(qda_fit, new_data = league_testing) %>% 
    roc_auc(truth = win, estimate = .pred_TRUE)

Elastic Net

# Prepare to tune the parameters
elastic_spec <- multinom_reg(penalty = tune(), mixture = tune()) %>%
    set_engine("glmnet") %>%
    set_mode("classification")

elastic_workflow <- workflow() %>%
    add_recipe(league_recipe) %>%
    add_model(elastic_spec)

# Regularization
penalty_grid <- grid_regular(penalty(c(-5, 5)), mixture(c(0,1)), levels = 10)
elastic_res <- tune_grid(
    elastic_workflow,
    resamples = league_folded,
    grid = penalty_grid,
)

# Save to a single r data to save time
saveRDS(elastic_res, "elastic_res.rds")

Regularization penalty should not be more than 1

# Load from already computed results
elastic_res <- readRDS("save/elastic_res.rds")

autoplot(elastic_res)

# Select the best
elastic_best <- select_best(elastic_res, metric = "roc_auc")

# Fit with the best params
elastic_final_fit <- finalize_workflow(elastic_workflow, elastic_best) %>% 
    fit(league_training)


elastic_roc_auc <- augment(elastic_final_fit, new_data = league_testing) %>% 
    roc_auc(truth = win, estimate = .pred_TRUE)

elastic_roc_auc
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.979

Random Forest and boosting

For the random forest, I used 5 as it is close to the square root of the number of factors. This is the recommended number for classification

rf_spec <- rand_forest(mtry = 5, trees = tune(), min_n = tune()) %>% 
    set_engine("ranger", importance = "impurity") %>% 
    set_mode("classification")

rf_workflow <- workflow() %>% 
    add_recipe(league_recipe) %>% 
    add_model(rf_spec)
rf_grid <- grid_regular(trees(), min_n(), levels = 5)

rf_res <- tune_grid(
    rf_workflow,
    resamples = league_folded,
    grid = rf_grid,
    control = control_grid(verbose = TRUE)
)

saveRDS(rf_res, "save/rf_res.rds")
rf_res <- readRDS("save/rf_res.rds")
rf_final_fit <- finalize_workflow(rf_workflow, select_best(rf_res, metric = "roc_auc")) %>% 
    fit(league_training)

rf_final_fit

saveRDS(rf_final_fit, "save/rf_final_fit.rds")
rf_final_fit <- readRDS("save/rf_final_fit.rds")

rf_roc_auc <- augment(rf_final_fit, new_data = league_testing) %>% 
    roc_auc(truth = win, estimate = .pred_TRUE)
boost_spec <- boost_tree(trees = tune(), tree_depth = tune()) %>% 
    set_engine("xgboost") %>% 
    set_mode("classification")

boost_grid <- grid_regular(trees(), tree_depth(c(2, 8)), levels = 5)

boost_workflow <- workflow() %>%
    add_recipe(league_recipe) %>%
    add_model(boost_spec)
boost_res <- tune_grid(
    boost_workflow,
    resamples = league_folded,
    grid = boost_grid,
)

saveRDS(boost_res, "save/boost_res.rds")

The tree depth here doesn’t matter too much, but we do need 500 trees.

boost_res <- readRDS("save/boost_res.rds")
autoplot(boost_res)

select_best(boost_res, metric = "roc_auc")
## # A tibble: 1 × 3
##   trees tree_depth .config              
##   <int>      <int> <chr>                
## 1   500          2 Preprocessor1_Model02
boost_final_fit <- finalize_workflow(boost_workflow, select_best(boost_res, metric = "roc_auc")) %>% 
    fit(league_training)


boost_roc_auc <- augment(boost_final_fit, new_data = league_testing) %>% 
    roc_auc(truth = win, estimate = .pred_TRUE)

SVM

svm_spec <- svm_poly(degree = 1, cost = tune()) %>%
  set_mode("classification") %>%
  set_engine("kernlab", scaled = FALSE)

svm_workflow <- workflow() %>% 
    add_recipe(league_recipe) %>% 
    add_model(svm_spec)
svm_grid <- grid_regular(cost(c(-5, 5)), levels = 5)

svm_res <- tune_grid(
    svm_workflow,
    resamples = league_folded,
    grid = svm_grid,
    control = control_grid(verbose = TRUE)
)

saveRDS(svm_res, "save/svm_res.rds")

The roc_auc doesn’t change much after 0.125 cost

svm_res <- readRDS("save/svm_res.rds")
autoplot(svm_res)

svm_final_fit <- finalize_workflow(svm_workflow, select_best(svm_res, metric = "roc_auc")) %>% 
    fit(league_training)

saveRDS(svm_final_fit, "save/svm_final_fit.rds")
svm_final_fit <- readRDS("save/svm_final_fit.rds")

svm_roc_auc <- augment(svm_final_fit, new_data = league_testing) %>% 
    roc_auc(truth = win, estimate = .pred_TRUE)

Results

Now all the model has been trained. It’s time to join all the ROC_AUC together and compare

model_names <- c("LogisticRegression", "LDA", "QDA", "ElasticNet", "RandomForest", "Boost", "SupportVectorMachine")

model_roc_aucs <- c(
    log_roc_auc$.estimate,
    lda_roc_auc$.estimate,
    qda_roc_auc$.estimate,
    elastic_roc_auc$.estimate,
    rf_roc_auc$.estimate,
    boost_roc_auc$.estimate,
    svm_roc_auc$.estimate
)

# Combine the two lists into a data frame
all_roc_aucs <- bind_cols(model_name=model_names, roc_auc=model_roc_aucs)

boost_roc_auc
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.985
# Plot with reordered bars
all_roc_aucs %>% 
    ggplot(aes(x = reorder(model_name, roc_auc), y = roc_auc)) +
    geom_col(width = 0.2) +
    theme(text = element_text(size = 12)) +
    xlab("Models") + ylab("ROC AUC") +
    geom_text(aes(label = roc_auc), position = position_dodge(0.9), vjust = -0.25)

All the models except QDA have an ROC_AUC larger than 0.97 and the best is Boost It has 500 trees and 2 depths

For people who don’t know, yhe ROC AUC score is a measure of a model’s ability to distinguish between positive and negative examples, with a score of 1.0 indicating perfect accuracy and a score of 0.5 indicating random guessing. A score of 0.939 is well above the random guessing baseline, and indicates that the model is able to make highly accurate predictions.

augment(boost_final_fit, new_data = league_testing) %>% 
    roc_curve(truth = win, estimate = .pred_TRUE) %>% 
    autoplot()

And the accuracy and the confusion matrix on the new data

augment(boost_final_fit, new_data = league_testing) %>% 
    accuracy(truth = win, estimate = .pred_class) 
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.938
augment(boost_final_fit, new_data = league_testing) %>% 
    conf_mat(truth = win, estimate = .pred_class) %>% 
    autoplot()

The results is pretty good, with nearly 0.94 accuracy.

There are several reasons why a boosting model might produce the best ROC AUC score compared to other models such as logistic regression, LDA, QDA, elastic net, SVM, and random forest.

One reason is that boosting algorithms are ensemble models, which means that they combine the predictions of multiple weak models to produce a more accurate and robust prediction. This can help to reduce overfitting and improve the model’s ability to generalize to new data.

Another reason is that boosting algorithms typically use decision trees as the weak models, which are powerful and flexible models that can capture complex nonlinear relationships in the data. This can allow the boosting model to make more accurate predictions, especially for datasets with complex and high-dimensional features.

The low ROC AUC score of QDA in the analysis is likely due to its assumption of linear separability and equal covariance, which are often violated in real-world data. This can limit the model’s ability to accurately predict the outcome of games, and explain why it performed worse than other models in the analysis.

Did our assumption works and how to improve our win rate.

The variable importance plots from the random forest fit provide valuable insights into the factors that contribute most to the win rate in League of Legends. The plots show that our first assumption - that kill is a major factor in determining the outcome of a game - is not entirely accurate, as it is ranked after over 10 other variables. Our second assumption - that first blood is a key predictor of success - is also incorrect, as it is one of the least contributing factors to the win rate, so even if the team get the first blood, they should still be cautious.

The plots also support our third assumption that vision score is an important factor in determining the outcome of a game. However, the interaction between kill and vision score appears to have a greater impact on the win rate than either variable individually. This suggests that players should prioritize both kill and vision score in order to maximize their chances of success.

Finally, our fourth assumption - that stolen objectives are important for winning games - is proven to be wrong by the plots, which show that stolen objectives have little correlation with the win rate. This may be because players who are forced to steal objectives are often lacking in gold and experience, which makes it difficult for them to fight effectively in team battles. Overall, the variable importance plots provide valuable insights into the factors that drive success in League of Legends, and can help players prioritize their actions in order to maximize their chances of winning.

rf_final_fit %>% pull_workflow_fit() %>% vip(num_features = 40)
## Warning: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
## ℹ Please use `extract_fit_parsnip()` instead.

boost_final_fit  %>% pull_workflow_fit() %>% vip(num_features = 40)

Testing New data

What if we had some fresh new data from games I just played. To grab new data of myself, I created a python scrapying script to do the job

import bs4
import requests

PUUID = "secret"
API_KEY = "secret"

def get_header():
    return  {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36",
        "Accept-Language": "en-US,en;q=0.9,zh-Hant;q=0.8,zh-Hans;q=0.7,zh;q=0.6",
        "Accept-Charset": "application/x-www-form-urlencoded; charset=UTF-8",
    }
def get_match_ids(puuid=PUUID, count=20, start=0, api_key=API_KEY):
    # set up the url
    url = f"https://americas.api.riotgames.com/lol/match/v5/matches/by-puuid/{puuid}/ids?count={count}&start={start}&api_key={api_key}"
    # make the request
    response = requests.get(url, headers=get_header())
    # get the json data
    return response.json()

def get_match_data(match_id, api_key=API_KEY):
    url = f"https://americas.api.riotgames.com/lol/match/v5/matches/{match_id}?api_key={api_key}"
    response = requests.get(url, headers=get_header())
    return response.json()

# Get the first 2 matches
match_ids = get_match_ids(PUUID, 20, 0, API_KEY)
match_data_1 = get_match_data(match_ids[0], API_KEY)
match_data_2 = get_match_data(match_ids[1], API_KEY)



class Participant:
    assists: int
    baronKills: int
    bountyLevel: int
    champExperience: int
    champLevel: int
    championId: int
    championName: str
    championTransform: int
    consumablesPurchased: int
    damageDealtToBuildings: int
    damageDealtToObjectives: int
    damageDealtToTurrets: int
    damageSelfMitigated: int
    deaths: int
    detectorWardsPlaced: int
    doubleKills: int
    dragonKills: int
    firstBloodAssist: bool
    firstBloodKill: bool
    firstTowerAssist: bool
    firstTowerKill: bool
    gameEndedInEarlySurrender: bool
    gameEndedInSurrender: bool
    goldEarned: int
    goldSpent: int
    individualPosition: str
    inhibitorKills: int
    inhibitorTakedowns: int
    inhibitorsLost: int
    item0: int
    item1: int
    item2: int
    item3: int
    item4: int
    item5: int 
    item6: int
    itemsPurchased: int
    killingSprees: int
    kills: int
    lane: str
    largestCriticalStrike: int
    largestKillingSpree: int
    largestMultiKill: int
    longestTimeSpentLiving: int
    magicDamageDealt: int
    magicDamageDealtToChampions: int
    magicDamageTaken: int
    neutralMinionsKilled: int
    nexusKills: int
    nexusTakedowns: int
    objectivesStolen: int
    objectivesStolenAssists: int
    participantId: int
    pentaKills: int
    perks: dict
    physicalDamageDealt: int
    physicalDamageDealtToChampions: int
    physicalDamageTaken: int
    profileIcon: int
    puuid: str
    quadraKills: int
    riotIdName: str
    riotIdTagline: str
    role: str
    sightWardsBoughtInGame: int
    spell1Casts: int
    spell2Casts: int
    spell3Casts: int
    spell4Casts: int
    summoner1Casts: int
    summoner1Id: int
    summoner2Casts: int
    summoner2Id: int
    summonerId: str
    summonerLevel: int
    summonerName: str
    teamEarlySurrendered: bool
    teamId: int
    teamPosition: str
    timeCCingOthers: int
    timePlayed: int
    totalDamageDealt: int
    totalDamageDealtToChampions: int
    totalDamageShieldedOnTeammates: int
    totalDamageTaken: int
    totalHeal: int
    totalHealsOnTeammates: int
    totalMinionsKilled: int
    totalTimeCCDealt: int
    totalTimeSpentDead: int
    totalUnitsHealed: int
    tripleKills: int
    trueDamageDealt: int
    trueDamageDealtToChampions: int
    trueDamageTaken: int
    turretKills: int
    turretTakedowns: int
    turretsLost: int
    unrealKills: int
    visionScore: int
    visionWardsBoughtInGame: int
    wardsKilled: int    
    wardsPlaced: int
    win: bool

    def __init__(self, participant_data):
        self.assists = participant_data["assists"]
        self.baronKills = participant_data["baronKills"]
        self.bountyLevel = participant_data["bountyLevel"]
        self.champExperience = participant_data["champExperience"]
        self.champLevel = participant_data["champLevel"]
        self.championId = participant_data["championId"]
        self.championName = participant_data["championName"]
        self.championTransform = participant_data["championTransform"]
        self.consumablesPurchased = participant_data["consumablesPurchased"]
        self.damageDealtToBuildings = participant_data["damageDealtToBuildings"]
        self.damageDealtToObjectives = participant_data["damageDealtToObjectives"]
        self.damageDealtToTurrets = participant_data["damageDealtToTurrets"]
        self.damageSelfMitigated = participant_data["damageSelfMitigated"]
        self.deaths = participant_data["deaths"]
        self.detectorWardsPlaced = participant_data["detectorWardsPlaced"]
        self.doubleKills = participant_data["doubleKills"]
        self.dragonKills = participant_data["dragonKills"]
        self.firstBloodAssist = participant_data["firstBloodAssist"]
        self.firstBloodKill = participant_data["firstBloodKill"]
        self.firstTowerAssist = participant_data["firstTowerAssist"]
        self.firstTowerKill = participant_data["firstTowerKill"]
        self.gameEndedInEarlySurrender = participant_data["gameEndedInEarlySurrender"]
        self.gameEndedInSurrender = participant_data["gameEndedInSurrender"]
        self.goldEarned = participant_data["goldEarned"]
        self.goldSpent = participant_data["goldSpent"]
        self.individualPosition = participant_data["individualPosition"]
        self.inhibitorKills = participant_data["inhibitorKills"]
        self.inhibitorTakedowns = participant_data["inhibitorTakedowns"]
        self.inhibitorsLost = participant_data["inhibitorsLost"]
        self.item0 = participant_data["item0"]
        self.item1 = participant_data["item1"]
        self.item2 = participant_data["item2"]
        self.item3 = participant_data["item3"]
        self.item4 = participant_data["item4"]
        self.item5 = participant_data["item5"]
        self.item6 = participant_data["item6"]
        self.itemsPurchased = participant_data["itemsPurchased"]
        self.killingSprees = participant_data["killingSprees"]
        self.kills = participant_data["kills"]
        self.lane = participant_data["lane"]
        self.largestCriticalStrike = participant_data["largestCriticalStrike"]
        self.largestKillingSpree = participant_data["largestKillingSpree"]
        self.largestMultiKill = participant_data["largestMultiKill"]
        self.longestTimeSpentLiving = participant_data["longestTimeSpentLiving"]
        self.magicDamageDealt = participant_data["magicDamageDealt"]
        self.magicDamageDealtToChampions = participant_data["magicDamageDealtToChampions"]
        self.magicDamageTaken = participant_data["magicDamageTaken"]
        self.neutralMinionsKilled = participant_data["neutralMinionsKilled"]
        self.nexusKills = participant_data["nexusKills"]
        self.nexusTakedowns = participant_data["nexusTakedowns"]
        self.objectivesStolen = participant_data["objectivesStolen"]
        self.objectivesStolenAssists = participant_data["objectivesStolenAssists"]
        self.participantId = participant_data["participantId"]
        self.pentaKills = participant_data["pentaKills"]
        self.perks = participant_data["perks"]
        self.physicalDamageDealt = participant_data["physicalDamageDealt"]
        self.physicalDamageDealtToChampions = participant_data["physicalDamageDealtToChampions"]
        self.physicalDamageTaken = participant_data["physicalDamageTaken"]
        self.profileIcon = participant_data["profileIcon"]
        self.puuid = participant_data["puuid"]
        self.quadraKills = participant_data["quadraKills"]
        self.riotIdName = participant_data["riotIdName"]
        self.riotIdTagline = participant_data["riotIdTagline"]
        self.role = participant_data["role"]
        self.sightWardsBoughtInGame = participant_data["sightWardsBoughtInGame"]
        self.spell1Casts = participant_data["spell1Casts"]
        self.spell2Casts = participant_data["spell2Casts"]
        self.spell3Casts = participant_data["spell3Casts"]
        self.spell4Casts = participant_data["spell4Casts"]
        self.summoner1Casts = participant_data["summoner1Casts"]
        self.summoner1Id = participant_data["summoner1Id"]
        self.summoner2Casts = participant_data["summoner2Casts"]
        self.summoner2Id = participant_data["summoner2Id"]
        self.summonerId = participant_data["summonerId"]
        self.summonerLevel = participant_data["summonerLevel"]
        self.summonerName = participant_data["summonerName"]
        self.teamEarlySurrendered = participant_data["teamEarlySurrendered"]
        self.teamId = participant_data["teamId"]
        self.teamPosition = participant_data["teamPosition"]
        self.timeCCingOthers = participant_data["timeCCingOthers"]
        self.timePlayed = participant_data["timePlayed"]
        self.totalDamageDealt = participant_data["totalDamageDealt"]

        self.totalDamageDealtToChampions = participant_data["totalDamageDealtToChampions"]
        self.totalDamageShieldedOnTeammates = participant_data["totalDamageShieldedOnTeammates"]
        self.totalDamageTaken = participant_data["totalDamageTaken"]
        self.totalHeal = participant_data["totalHeal"]  
        self.totalHealsOnTeammates = participant_data["totalHealsOnTeammates"]
        self.totalMinionsKilled = participant_data["totalMinionsKilled"]
        self.totalTimeCCDealt = participant_data["totalTimeCCDealt"]
        self.totalTimeSpentDead = participant_data["totalTimeSpentDead"]
        self.totalUnitsHealed = participant_data["totalUnitsHealed"]
        self.tripleKills = participant_data["tripleKills"]
        self.trueDamageDealt = participant_data["trueDamageDealt"]
        self.trueDamageDealtToChampions = participant_data["trueDamageDealtToChampions"]
        self.trueDamageTaken = participant_data["trueDamageTaken"]
        self.turretKills = participant_data["turretKills"]
        self.turretTakedowns = participant_data["turretTakedowns"]
        self.turretsLost = participant_data["turretsLost"]
        self.unrealKills = participant_data["unrealKills"]
        self.visionScore = participant_data["visionScore"]
        self.visionWardsBoughtInGame = participant_data["visionWardsBoughtInGame"]
        self.wardsKilled = participant_data["wardsKilled"]
        self.wardsPlaced = participant_data["wardsPlaced"]
        self.win = participant_data["win"]

class MatchData:
    game_type: str
    game_duration: int
    gameMode: str
    mapId: int
    participant: list()
    teams: list
    def __init__(self, match_data):
        self.game_type = match_data["info"]["gameType"]
        self.game_duration = match_data["info"]["gameDuration"]
        self.gameMode = match_data["info"]["gameMode"]
        self.mapId = match_data["info"]["mapId"]
        self.teams = match_data["info"]["teams"]

        self.participant = []
        for participant in match_data["info"]["participants"]:
            self.participant.append(Participant(participant))


def get_participant_stats(game_data, puuid=PUUID):
    for participant in game_data.participant:
        if participant.puuid == puuid:
            return participant

def print_stats(stat: Participant):
    print({stat.kills},
        {stat.deaths},
        {stat.assists}, 
        {stat.champLevel}, 
        {stat.objectivesStolen}, 
        {stat.objectivesStolenAssists}, 
        {stat.baronKills}, 
        {stat.dragonKills}, 
        {stat.visionScore}, 
        {stat.damageDealtToBuildings}, 
        {stat.firstBloodAssist}, 
        {stat.firstBloodKill}, 
        {stat.firstTowerAssist}, 
        {stat.firstTowerKill}, 
        {stat.inhibitorTakedowns}, 
        {stat.inhibitorsLost}, 
        {stat.longestTimeSpentLiving}, 
        {stat.neutralMinionsKilled}, 
        {stat.timeCCingOthers},
        {stat.totalDamageDealtToChampions},
        {stat.totalDamageTaken},
        {stat.totalHeal},
        {stat.totalHealsOnTeammates},
        {stat.totalMinionsKilled},
        {stat.totalTimeSpentDead},
        {"blue" if stat.teamId == 100 else "red"},
        {stat.win})


game_1_stat = get_participant_stats(MatchData(match_data_1))
game_2_stat = get_participant_stats(MatchData(match_data_2))

print_stats(game_1_stat)
print_stats(game_2_stat)

The above code generates the following strings after replace a few words. And Yes! The model gives the correct prediction!

obs1 <- paste("8,7,4,16,0,0,0,0,6,2189,False,False,False,False,0,3,495,1,2,27427,28297,458,0,174,227,blue,False")
obs2 <- paste("4,3,4,12,0,0,0,0,9,1750,False,False,False,False,1,0,784,1,3,11734,8750,392,0,104,62,blue,True")


# Get the dataframe from text
obs1 <- read.table(text = obs1, sep = ",", col.names = colnames(league_training))
obs2 <- read.table(text = obs2, sep = ",", col.names = colnames(league_training))


# Factorization
new_testing <- bind_rows(obs1, obs2)
new_testing$win <- factor(as.logical(new_testing$win))
new_testing$first_tower_kill <- factor(as.logical(new_testing$first_tower_kill))
new_testing$first_tower_assist <- factor(as.logical(new_testing$first_tower_assist))
new_testing$first_blood_kill <- factor(as.logical(new_testing$first_blood_kill))
new_testing$first_blood_assist <- factor(as.logical(new_testing$first_blood_assist))
new_testing$team <- factor(new_testing$team)

# Predict the new data
augment(rf_final_fit, new_data = new_testing) %>% 
    select(win, .pred_class)
## # A tibble: 2 × 2
##   win   .pred_class
##   <fct> <fct>      
## 1 FALSE FALSE      
## 2 TRUE  TRUE

Extra: Is predict win rate before the game start possible?

Even though the above models offer excellent prediction accuracy, they are based on end-game stats and may not be useful for players who want to know their chances of winning before the game starts. In these cases, players only have information about the champions that will be in the game, but not about how well they and their opponents will perform. Using only this information, it is difficult to accurately predict the outcome of a game.

In order to test the feasibility of using only champion information to predict the win rate, I built a DNN model. However, the results were not as good as I had hoped, with an validation accuracy of only 51.62% - only slightly better than a random generator. This is likely due to the fact that the game data was all recorded on the same day, so the effects of regular champion balancing tend to even out the win rates. Overall, it seems that using only champion information is not a reliable way to predict the outcome of a game.

The extra analysis can be found in the same folder analysis-extra.html it is written in python

Conclusion

The analysis suggests that the best model to predict the win rate of a League of Legends game based on end-game stats is a boosting model. This is likely due to the fact that boosting algorithms are ensemble models that combine the predictions of multiple weak models, and are able to capture complex nonlinear relationships in the data.

It is worth noting that the analysis only used a subset of the available data, and the performance of the model may improve if more of the dataset is used. However, even with the limited data, the model achieved an impressive AUC ROC score of over 0.98, indicating that it is highly accurate at predicting the outcome of games.

Still, there are several ways that the analysis and modeling process could be further improved. For example, the analysis could be extended to include more predictors, such as player skill level, game mode, and the specific champions used by each player. This could provide additional insights into the factors that influence the win rate, and allow the model to make more accurate predictions.

Another potential improvement would be to use more advanced techniques for feature engineering and selection. For example, the analysis could incorporate techniques such as dimensionality reduction, feature selection algorithms, and interaction terms, to identify the most important features and improve the model’s predictive power.

Additionally, the analysis could be extended to include more advanced models, such as deep learning neural networks or gradient boosting machines, which are known to perform well on complex and high-dimensional datasets. This could further improve the model’s performance and enable it to make even more accurate predictions.

Overall, there are many potential ways to improve the analysis and modeling process, and further exploration and experimentation could provide valuable insights into the factors that influence the win rate in League of Legends.